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We construct a phenomenological effective field theory model that describes the universality class 
of biologically active single-strand proteins. The model allows both for an explicit construction of 
native state protein conformations, and a dynamical description of protein folding and unfolding 
processes. The model reveals a connection between homochirality and protein collapse, and enables 
the theoretical investigation of various other aspects of protein folding even in the case of very long 
polypeptide chains where other methods are not available. 
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Various techniques have been developed for the theoretical analysis of protein folding yLj, which is maybe the most 
important problem in molecular biology. In particular all-atom simulations that employ very accurate semi-empirical 
potential energy functions, facilitate at least in principle a high resolution description of the folding process. But 
computationally the problem is NP-hard and thus far these very powerful methods have been limited to proteins that 
have only a relatively small degree of polymerization [1] . 

Here we present a phenomenological effective field theory model that describes in a very realistic manner the dy- 
namical details of protein folding, even in the case of very long polypeptide chains. Effective field theory models 
are often employed and sometimes even with great success, to address complicated problems when the exact theo- 
retical principles are either unknown, or have a structure that is too complex for analytic or numerical treatments. 
Familiar examples of powerful and predictive effective field theory models include the Ginzburg-Landau approach to 
superconductivity ^ and the Skyrme model of atomic nuclei . 

In polymer physics field theory techniques became popular after de Gennes |4j, [5] showed the equivalence between 
the self-avoiding random walk and the iV limit of the 0{N) symmetric (0^)^ scalar field theory, and proposed that 
polymer collapse can be modelled by including an additional (0^)'^ self-interaction. This approach is very powerful 
in characterizing the critical properties of polymers. But to our knowledge there are no effective field theory models 
that detail the dynamics of protein folding and explicitely describe the native state conformations of proteins. 

Here we present a phenomenological effective field theory that resides in the same universality class with biologically 
active proteins. The model enables both the construction of realistic three dimensional protein conformations, and 
a detailed analysis of the dynamics of folding and unfolding processes. This makes our model particularly valuable 
tool in the study of phenomenological aspects of protein folding. In particular, our model can describe the details of 
proteins folding even in the case of very long polypeptide chains where the semi-empirical all-atom simulations still 
have a long way to go. 

The compactness index v that describes how the radius of gyration Rg scales in the number of central carbons N 



is a universal quantity in the limit of large [5]. Here (i = 1, 2, N) are the locations of the central carbons and 
L is a form factor that characterizes an effective distance between monomers. At high temperatures we expect that 
ly quite universally approaches the Flory value [5] ~ 3/5 that corresponds to the universality class of self-avoiding 
random walk; Monte-Carlo estimates refine this to i/ « 0.588 ... [6]. On the other hand, for biologically active proteins 
an analysis of the data in Protein Data Bank [7] yields an estimate ly ^ 2/5 [Sj, This is in line with the widely 
held view pj that native state proteins are in the = 1/3 universality class of compact matter. But to our knowledge 
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none of the available theoretical models of protein folding has until now been able to accurately describe the v ^ 2/^ 
(or 1/ ~ 1/3) scaling law of biological active proteins. 

Here we show that the scaling law of biologically active proteins is computed by the free energy of a discrete version 
of the two dimensional Abelian Higgs model with an 0(2) ^ U{\) symmetric Higgs field, originally introduced to 
describe superconductivity [2 . The variant considered here was employed in [10, to embed string- like configurations 
in three dimensional space; the U{\) gauge invariance originates from the requirement that the physical properties of 
a string must be independent of the choice of local frames in the normal planes. This principle of gauge invariance 
leads us to an essentially unique free energy to describe folding proteins, 

N N N 

— 1 i—1 i=i 

Here i,j = 1, ...,iV label the central carbon atoms along the protein backbone. The variable corresponds to the 
gauge invariant signed modulus of the Higgs field, and is the space component of the gauge invariant supercurrent 
[5]. The first term describes long-distance correlations, it is responsible for the derivative term of the Higgs field in 
the continuum limit. We have introduced the cosine function to tame excessive fiuctuations in k^. The middle term 
describes the interaction between Ki and r^, and the symmetry breaking self-interaction of Ki. Finally, the last term 
is a discretized one-dimensional version of the Chern-Simons functional [TT] . Its presence provides a simple model for 
the observed homochirality of biologically active proteins, a positive (negative) parameter di gives rise to left-handed 
(right-handed) chirality 

We relate the variables in (|2| to the protein backbone geometry as follows: The Higgs field ki describes the signed 
Frenet curvature of the backbone at the site z, and is the corresponding frame independent Frenet torsion. Once 
the numerical values of Ki and Tj are known, the geometric shape of the backbone in the three dimensional space M.^ is 
obtained by integrating a discretized version of the Frenet equations [13' . This integration also introduces parameters 
Ai, the average finite lengths of the peptide bonds. 

The quantities a^- , cjij , 5^ , q , /i^ , are free parameters, and different values of these parameters describe different 
kind of amino acid structures. For simplicity we consider here only the nearest neighbor interactions 

_ ja-{6i,,+i+5u-i) {i = 2,...,N -I) 

~\ a (z = 1, j = 2) & {i = N ^ N) ^'^> 

For simplicity we also select all the remaining parameters to be independent of the site index i. Our choice corresponds 
to a homogeneous protein backbone. 

We fold the protein iteratively, by free energy minimization. At each iteration step we first generate a new set of 
values for the curvature and torsion (Ki,Ti) using the Metropolis algorithm jl4j with a finite Metropolis temperature- 
like parameter Tm (that can be related to the actual physical temperature by an appropriate scaling). We then 
construct a new protein backbone by solving the discrete Frenet equations with a fixed and uniform peptide bond 
length A, 

|r(s,)-r(s,_i)| = A i^2,...,N. (4) 

Finally, before accepting the new protein backbone we exclude steric clashes by demanding that the distance between 
any two central carbon atoms in the new backbone satisfies the bound 

|r(s,)-r(s,)| >z for \t~j\>2. (5) 

We note that in a native state protein it is quite common for z to acquire values that are of the order of, say, 10% 
smaller than A. 

Our simulations start from an initial configuration with Hi = Ti = 0. This corresponds to a straight, untwisted 
protein backbone. Since the initial Metropolis step is determined randomly, essentially by a thermal fluctuation, our 
starting point has a large conformational entropy. Consequently we expect that statistically our final conformations 
cover the landscape of native protein states. 

The various parameters in (|2| are not fully independent but can be related to each other by diverse scaling trans- 
formations and changes of variables. We derive additional restrictions on these parameters by comparing the results 
of our simulations to the properties of biological proteins. For example, in line with native state proteins we impose 
the constraint that in a full 2tt a-helix turn there are on average about 3.6 central carbons. We have arrived at the 
final parameter values used in our simulations after an extensive trial-and-error procedure, to obtain results that are 
as close as possible to the universality class of biologically active proteins. A numerical survey around our chosen 
parameter values suggests that they are optimal, at least locally in the space of parameters. 
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We have made extensive numerical simulations using configurations where the number A'' of central carbon atoms 
lies in the range 75 < N < 1,000. For these configurations we typically arrive at a stable folded state after around 
1,000,000 steps. The folding process takes no more than a few tens of seconds in a MacPro desktop computer, even for 
the large values of TV. But in order to ensure the stability of our final configurations we have extended our simulations 
to 22,000,000 steps. Besides thermal fluctuations, we observe no essential change in the folded structures after the 
initial 1,000,000 steps which confirms that we have reached a native state. 

In Figure 1 we have placed all biologically active single-stranded proteins that can be presently harvested from the 
Protein Data Bank, with the number TV of central carbons in the range of 75 < iV < 1,000. Using a least square 
linear fit to the data we find for the compactness index the value fpDB — 0.378 ± 0.0017, which is in line with the 
results previously reported in the literature [8], [9]. In Figure 1 we also show how the compactness index in our 
model depends on iV when 75 < iV < 1, 000, using a statistical sample of 80 runs for each value of N. When we apply 
a least square linear fit to our results we find for the compactness index the estimate — 0.379 ± 0.0081, in excellent 
agreement with the value obtained from the Protein Data Bank. This excellent agreement confirms that our model 
does describe the universality class of biologically active proteins. From the data in Figure 1 we find that our model 
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FIG. 1: Least square linear fit to the compactness index v computed m our model (v = 0.379 ± O.OOSlj compared with that 
describing all single strand proteins currently deposited at the Protein Data Bank (vpdb = 0.378 ± 0.0017j. The error-bars 
describe standard deviation from the average, a measure of conformational entropy in our initial configuration. 

predicts for the form factor L in ([I]) the numerical value L — 2.656 ± 0.049 (A). This compares well with the average 
value LpoB = 2.254 ± 0.021 (A) that we obtain using a least square fit to the Protein Data Bank data displayed in 
Figure 1. 

We observe from Figure 1 that the standard deviation displayed by our final conformations are comparable in 
size to the actual spreading of biologically active proteins around their experimentally determined average values. 
The standard deviation is a measure of conformational entropy, and consequently at each value of N our initial 
configuration appears to have enough conformational entropy for our model to cover the entire landscape of native 
state protein folds. 

We have verified that our value of v is temperature independent for a wide range of temperatures: In our model v 
is insensitive to an increase in the Metropolis temperature Tm until Tm reaches a critical value that we can normalize 
to Tc ~ 330 (if). At this critical temperature there is an onset of a transition towards the 0-point, and at the 0-point 
we estimate v « 0.48 — 0.49 in line with the expected value v ^ \/2 that characterizes the universality class of a 
random coil. In the limit of high temperatures we find v ~ 0.65 which is slightly above the Flory value v — 3/5 for a 
self-avoiding random walk. It appears to us that the slight differences between our estimates and those expected on 
general grounds [5 , are due to finite length effects. 

We have also studied the effect of the various operators in ^ in determining the universality class: 

We find that the value v « 0.379 ... is entirely due to the presence of the chirality breaking Chern-Simons term: 
When we remove the Chern-Simons term by setting = in ([2]) the compactness index increases to v 0.488 . . . 
which is very close to the 0-point value v ^ 1/2. Thus our model proposes that the folding of biologically active 
proteins is driven by their chirality. 

When we in addition remove the direct coupling between torsion and curvature by setting bi — di = the compact- 
ness index remains near its 0-point value f « 0.488 .... 

When we remove the entire symmetry breaking potential by setting = in Q we find that v approaches the 
value ly ~ 0.737 .... This proposes that we may have a novel universality class in the low temperature phase. Notice 
that in this case the local minima of the potential energy are absent. 

When we set aij = we find v w 0.370.... Consequently the non-local coupling between curvatures appears to 
have a tendency to {very slightly) increase v. We also find that in the absence of Uij the value of L tends to increase 
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slightly, to L « 2.96 ... . 

Finally, at very high temperatures Tm, when we set bi — di — we find that the compactness index, as expected, 
approaches the Flory value 3/5; we now get 0.61 ... . 

In Figure 2 we show using an example with N = 300, how the compactness index v evolves as a function of the 
number of iterations ("time"), during the first 1,000,000 steps. In this figure we also describe how the free energy 
(pl develops as a function of the iteration steps. We find that while u generically approaches its asymptotic value 




FIG. 2: The red line shows how the compactness index v typically evolves as a function of the number of interation steps (time) 
when computed as an average over statistical samples and up to 1,000,000 steps. The blue line shows similarly how the average 
energy typically evolves as a function of the number of interatton steps. 

V w 0.379 . . . very rapidly, after only a few thousand iterations, the process of energy minimization typically takes 
about two orders of magnitude longer. The asymptotic behaviour of the curves confirms that the final state is highly 
(meta)stable. The stability is further validated by a comparison with Figure 1 where we report on results after the 
iteration process has been continued by 21,000,000 additional steps: For Tm < Tq k, 330 {K) we find no essential 
change in the final conformations after N ~ 1,000,000 steps, beyond thermal fluctuations. 

Our interpretation of the Figure 2 is that the folding process described by our model follows very closely the folding 
process of biological proteins: The initial denatured state first rapidly collapses into a molten globule, with a large 
decrease in conformational entropy but only a very small change in the internal energy. After the initial collapse to 
the molten globule with the ensuing formation of secondary structures such as a-helices and /3-sheets, the folding 
process continues with a relatively slow conformational re-arrangement towards a locally stable conformation. The 
final state has a substantially lower energy than the corresponding molten globule state. 

Finally, we have compared our native states to the hierarchical classification scheme CATH IS]- We find that 
our conformations are in very good overall correspondence with this classification scheme. In particular, our model 
appears to produce all the major secondary structures of biologically active proteins. But when we compare the 
number of conformations that we produce in different classes with the data presently available in the Protein Data 
Bank, we conclude that as it stands our model produces a statistical excess of native folds in the mainly-a class 
in comparison to the data currently deposited in the Protein Data Bank. In order to produce a statistically larger 
proportion of folds in the a-[3 and in particular in the mainly-/? class, we presumably need to consider a more involved 
nonlocal coupling ([3]). 

In summary, we have developed a phenomenological effective field theory model that describes realistically the 
folding dynamics and native state folds of protein backbones. Since our model folds even very long proteins within a 
few tens of seconds in a desktop personal computer, it allows us to describe and analyze phenomena that are not yet 
reachable by other theoretical means. The model computes accurately the compactness index v of native state proteins 
and in particular it proposes that protein collapse is driven by homochirality which is described by the Chern-Simons 
functional. Furthermore, since the native states are in line with the CATH classification scheme, the model has great 
promise to become an effective tool for producing conformational templates for the purpose of protein design and 
engineering. 

Our research is supported by grants from the Swedish Research Council (VR). The work by A.J.N is also supported 
by the Project Grant ANR NT05-142856. A.J.N, thanks H. Orland for discussions and advice. We all thank M. 
Chernodub for discussions, and N. Johansson and J. Minahan for comments. A.J.N, also thanks T. Gregory Dewey 



5 



for communications. A.J.N, thanks the Aspen Center for Physics for hospitaHty during this work. 



[1] L. Mirny, E. Shakhnovich, Annual Review of Biophysics and Biomolecular Structure 30 (2001) 361; H.A. Scheraga, M. 

Khalili and A. Liwo, Annual Review of Physical Chemistry 58 (2007) 57; M. Oliveberg and P.G. Wolynes, Quarterly 

Reviews of Biophysics 38 (2005) 245 
[2] P.G. De Gennes, Superconductivity of Metals and Alloys (Westfield Press, New York 1995) 

[3] I. Zahed and G.E. Brown, Physics Reports 142 (1986) 1; T. Gisiger and M.B. Paranjape, Physics Reports 306 (1998) 109 
[4] P.G. De Gennes, Physics Letters 38A (1972) 339 

[5] P.G. De Gennes, Scaling Concepts tn Polymer Physics (Cornell University Press, Ithaca, 1979) 

[6] B. Li, N. Madras and A. Sokal, Journal of Statistical Physics 80 (1995) 661; N. Madras and G. Slade, The Self-Avoiding 
Walk (Birkhauser, Berlin, 1996) 

[7] H.M. Berman et.al, Nucleic Acids Research 28 (2000) 235 

[8] L. Hong and J. Lei, http://arxiv.org/abs/0711.3679vl 

[9] T.G. Dewey Journal of Chemical Physics 98 (1993) 2250 
[10] A.J. Niemi, Physical Review D67 (2003) 106004 
[11] S.-S. Chern, J. Simons, Annals of Mathematics 99 (1974) 48 

[12] For each i the potential energy in Q is unbounded from below with a global minimum at k = and r ^ oo. But whenever 
br^ < 2cfi^ there are two symmetric local minima in the vicinity of k « ±/i, r ~ —d/2bfi^. The folded proteins are 
soliton-like configurations that interpolate between these two local minima (at finite temperature). 

[13] M. Spivak, A Comprehensive Introduction to Differential Geometry Volume Two (Publish or Perish, Inc.. Houston 1999) 

[14] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Journal of Chemical Physics 21 (1053) 1087 

[15] A.L. Cuff et.al, Nucleic Acids Research (Advance Access published on November 7, 2008) 



